31 resultados para random forest data analysis

em DigitalCommons@The Texas Medical Center


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Most statistical analysis, theory and practice, is concerned with static models; models with a proposed set of parameters whose values are fixed across observational units. Static models implicitly assume that the quantified relationships remain the same across the design space of the data. While this is reasonable under many circumstances this can be a dangerous assumption when dealing with sequentially ordered data. The mere passage of time always brings fresh considerations and the interrelationships among parameters, or subsets of parameters, may need to be continually revised. ^ When data are gathered sequentially dynamic interim monitoring may be useful as new subject-specific parameters are introduced with each new observational unit. Sequential imputation via dynamic hierarchical models is an efficient strategy for handling missing data and analyzing longitudinal studies. Dynamic conditional independence models offers a flexible framework that exploits the Bayesian updating scheme for capturing the evolution of both the population and individual effects over time. While static models often describe aggregate information well they often do not reflect conflicts in the information at the individual level. Dynamic models prove advantageous over static models in capturing both individual and aggregate trends. Computations for such models can be carried out via the Gibbs sampler. An application using a small sample repeated measures normally distributed growth curve data is presented. ^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Microarray technology is a high-throughput method for genotyping and gene expression profiling. Limited sensitivity and specificity are one of the essential problems for this technology. Most of existing methods of microarray data analysis have an apparent limitation for they merely deal with the numerical part of microarray data and have made little use of gene sequence information. Because it's the gene sequences that precisely define the physical objects being measured by a microarray, it is natural to make the gene sequences an essential part of the data analysis. This dissertation focused on the development of free energy models to integrate sequence information in microarray data analysis. The models were used to characterize the mechanism of hybridization on microarrays and enhance sensitivity and specificity of microarray measurements. ^ Cross-hybridization is a major obstacle factor for the sensitivity and specificity of microarray measurements. In this dissertation, we evaluated the scope of cross-hybridization problem on short-oligo microarrays. The results showed that cross hybridization on arrays is mostly caused by oligo fragments with a run of 10 to 16 nucleotides complementary to the probes. Furthermore, a free-energy based model was proposed to quantify the amount of cross-hybridization signal on each probe. This model treats cross-hybridization as an integral effect of the interactions between a probe and various off-target oligo fragments. Using public spike-in datasets, the model showed high accuracy in predicting the cross-hybridization signals on those probes whose intended targets are absent in the sample. ^ Several prospective models were proposed to improve Positional Dependent Nearest-Neighbor (PDNN) model for better quantification of gene expression and cross-hybridization. ^ The problem addressed in this dissertation is fundamental to the microarray technology. We expect that this study will help us to understand the detailed mechanism that determines sensitivity and specificity on the microarrays. Consequently, this research will have a wide impact on how microarrays are designed and how the data are interpreted. ^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Introduction. Food frequency questionnaires (FFQ) are used study the association between dietary intake and disease. An instructional video may potentially offer a low cost, practical method of dietary assessment training for participants thereby reducing recall bias in FFQs. There is little evidence in the literature of the effect of using instructional videos on FFQ-based intake. Objective. This analysis compared the reported energy and macronutrient intake of two groups that were randomized either to watch an instructional video before completing an FFQ or to view the same instructional video after completing the same FFQ. Methods. In the parent study, a diverse group of students, faculty and staff from Houston Community College were randomized to two groups, stratified by ethnicity, and completed an FFQ. The "video before" group watched an instructional video about completing the FFQ prior to answering the FFQ. The "video after" group watched the instructional video after completing the FFQ. The two groups were compared on mean daily energy (Kcal/day), fat (g/day), protein (g/day), carbohydrate (g/day) and fiber (g/day) intakes using descriptive statistics and one-way ANOVA. Demographic, height, and weight information was collected. Dietary intakes were adjusted for total energy intake before the comparative analysis. BMI and age were ruled out as potential confounders. Results. There were no significant differences between the two groups in mean daily dietary intakes of energy, total fat, protein, carbohydrates and fiber. However, a pattern of higher energy intake and lower fiber intake was reported in the group that viewed the instructional video before completing the FFQ compared to those who viewed the video after. Discussion. Analysis of the difference between reported intake of energy and macronutrients showed an overall pattern, albeit not statistically significant, of higher intake in the video before versus the video after group. Application of instructional videos for dietary assessment may require further research to address the validity of reported dietary intakes in those who are randomized to watch an instructional video before reporting diet compared to a control groups that does not view a video.^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Next-generation DNA sequencing platforms can effectively detect the entire spectrum of genomic variation and is emerging to be a major tool for systematic exploration of the universe of variants and interactions in the entire genome. However, the data produced by next-generation sequencing technologies will suffer from three basic problems: sequence errors, assembly errors, and missing data. Current statistical methods for genetic analysis are well suited for detecting the association of common variants, but are less suitable to rare variants. This raises great challenge for sequence-based genetic studies of complex diseases.^ This research dissertation utilized genome continuum model as a general principle, and stochastic calculus and functional data analysis as tools for developing novel and powerful statistical methods for next generation of association studies of both qualitative and quantitative traits in the context of sequencing data, which finally lead to shifting the paradigm of association analysis from the current locus-by-locus analysis to collectively analyzing genome regions.^ In this project, the functional principal component (FPC) methods coupled with high-dimensional data reduction techniques will be used to develop novel and powerful methods for testing the associations of the entire spectrum of genetic variation within a segment of genome or a gene regardless of whether the variants are common or rare.^ The classical quantitative genetics suffer from high type I error rates and low power for rare variants. To overcome these limitations for resequencing data, this project used functional linear models with scalar response to develop statistics for identifying quantitative trait loci (QTLs) for both common and rare variants. To illustrate their applications, the functional linear models were applied to five quantitative traits in Framingham heart studies. ^ This project proposed a novel concept of gene-gene co-association in which a gene or a genomic region is taken as a unit of association analysis and used stochastic calculus to develop a unified framework for testing the association of multiple genes or genomic regions for both common and rare alleles. The proposed methods were applied to gene-gene co-association analysis of psoriasis in two independent GWAS datasets which led to discovery of networks significantly associated with psoriasis.^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Introduction. Despite the ban of lead-containing gasoline and paint, childhood lead poisoning remains a public health issue. Furthermore, a Medicaid-eligible child is 8 times more likely to have an elevated blood lead level (EBLL) than a non-Medicaid child, which is the primary reason for the early detection lead screening mandate for ages 12 and 24 months among the Medicaid population. Based on field observations, there was evidence that suggested a screening compliance issue. Objective. The purpose of this study was to analyze blood lead screening compliance in previously lead poisoned Medicaid children and test for an association between timely lead screening and timely childhood immunizations. The mean months between follow-up tests were also examined for a significant difference between the non-compliant and compliant lead screened children. Methods. Access to the surveillance data of all childhood lead poisoned cases in Bexar County was granted by the San Antonio Metropolitan Health District. A database was constructed and analyzed using descriptive statistics, logistic regression methods and non-parametric tests. Lead screening at 12 months of age was analyzed separately from lead screening at 24 months. The small portion of the population who were also related were included in one analysis and removed from a second analysis to check for significance. Gender, ethnicity, age of home, and having a sibling with an EBLL were ruled out as confounders for the association tests but ethnicity and age of home were adjusted in the nonparametric tests. Results. There was a strong significant association between lead screening compliance at 12 months and childhood immunization compliance, with or without including related children (p<0.00). However, there was no significant association between the two variables at the age of 24 months. Furthermore, there was no significant difference between the median of the mean months of follow-up blood tests among the non-compliant and compliant lead screened population for at the 12 month screening group but there was a significant difference at the 24 month screening group (p<0.01). Discussion. Descriptive statistics showed that 61% and 56% of the previously lead poisoned Medicaid population did not receive their 12 and 24 month mandated lead screening on time, respectively. This suggests that their elevated blood lead level may have been diagnosed earlier in their childhood. Furthermore, a child who is compliant with their lead screening at 12 months of age is 2.36 times more likely to also receive their childhood immunizations on time compared to a child who was not compliant with their 12 month screening. Even though there was no statistical significant association found for the 24 month group, the public health significance of a screening compliance issue is no less important. The Texas Medicaid program needs to enforce lead screening compliance because it is evident that there has been no monitoring system in place. Further recommendations include a need for an increased focus on parental education and the importance of taking their children for wellness exams on time.^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The need for timely population data for health planning and Indicators of need has Increased the demand for population estimates. The data required to produce estimates is difficult to obtain and the process is time consuming. Estimation methods that require less effort and fewer data are needed. The structure preserving estimator (SPREE) is a promising technique not previously used to estimate county population characteristics. This study first uses traditional regression estimation techniques to produce estimates of county population totals. Then the structure preserving estimator, using the results produced in the first phase as constraints, is evaluated.^ Regression methods are among the most frequently used demographic methods for estimating populations. These methods use symptomatic indicators to predict population change. This research evaluates three regression methods to determine which will produce the best estimates based on the 1970 to 1980 indicators of population change. Strategies for stratifying data to improve the ability of the methods to predict change were tested. Difference-correlation using PMSA strata produced the equation which fit the data the best. Regression diagnostics were used to evaluate the residuals.^ The second phase of this study is to evaluate use of the structure preserving estimator in making estimates of population characteristics. The SPREE estimation approach uses existing data (the association structure) to establish the relationship between the variable of interest and the associated variable(s) at the county level. Marginals at the state level (the allocation structure) supply the current relationship between the variables. The full allocation structure model uses current estimates of county population totals to limit the magnitude of county estimates. The limited full allocation structure model has no constraints on county size. The 1970 county census age - gender population provides the association structure, the allocation structure is the 1980 state age - gender distribution.^ The full allocation model produces good estimates of the 1980 county age - gender populations. An unanticipated finding of this research is that the limited full allocation model produces estimates of county population totals that are superior to those produced by the regression methods. The full allocation model is used to produce estimates of 1986 county population characteristics. ^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Approximately 795,000 new and recurrent strokes occur each year. Because of the resulting functional impairment, stroke survivors are often discharged into the care of a family caregiver, most often their spouse. This dissertation explored the effect that mutuality, a measure of the perceived positive aspects of the caregiving relationship, had on the stress and depression of 159 stroke survivors and their spousal caregivers over the first 12 months post discharge from inpatient rehabilitation. Specifically, cross-lagged regression was utilized to investigate the dyadic, longitudinal relationship between caregiver and stroke survivor mutuality and caregiver and stroke survivor stress over time. Longitudinal meditational analysis was employed to examine the mediating effect of mutuality on the dyads’ perception of family function and caregiver and stroke survivor depression over time.^ Caregivers’ mutuality was found to be associated with their own stress over time but not the stress of the stroke survivor. Caregivers who had higher mutuality scores over the 12 months of the study had lower perceived stress. Additionally, a partner effect of stress for the stroke survivor but not the caregiver was found, indicating that stroke survivors’ stress over time was associated with caregivers’ stress but caregivers’ stress over time was not significantly associated with the stress of the stroke survivor.^ This dissertation did not find mutuality to mediate the relationship between caregivers’ and stroke survivors’ perception of family function at baseline and their own or their partners’ depression at 12 months as hypothesized. However, caregivers who perceived healthier family functioning at baseline and stroke survivors who had higher perceived mutuality at 12 months had lower depression at one year post discharge from inpatient rehabilitation. Additionally, caregiver mutuality at 6 months, but not at baseline or 12 months, was found to be inversely related to caregiver depression at 12 months.^ These findings highlight the interpersonal nature of stress in the context of caregiving, especially among spousal relationships. Thus, health professionals should encourage caregivers and stroke survivors to focus on the positive aspects of the caregiving relationship in order to mitigate stress and depression. ^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Individuals with disabilities face numerous barriers to participation due to biological and physical characteristics of the disability as well as social and environmental factors. Participation can be impacted on all levels from societal, to activities of daily living, exercise, education, and interpersonal relationships. This study evaluated the impact of pain, mood, depression, quality of life and fatigue on participation for individuals with mobility impairments. This cross sectional study derives from self-report data collected from a wheelchair using sample. Bivariate correlational and multivariate analysis were employed to examine the relationship between pain, quality of life, positive and negative mood, fatigue, and depression with participation while controlling for relevant socio-demographic variables (sex, age, time with disability, race, and education). Results from the 122 respondents with mobility impairments demonstrated that after controlling for socio-demographic characteristics in the full model, 20% of the variance in participation scores were accounted for by pain, quality of life, positive and negative mood, and depression. Notably, quality of life emerged as being the single variable that was significantly related to participation in the full model. Contrary to other studies, pain did not appear to significantly impact participation outcomes for wheelchair users in this sample. Participation is an emerging area of interest among rehabilitation and disability researchers, and results of this study provide compelling evidence that several psychosocial factors are related to participation. This area of inquiry warrants further study, as many of the psychosocial variables identified in this study (mood, depression, quality of life) may be amenable to intervention, which may also positively influence participation.^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Autoimmune diseases are a group of inflammatory conditions in which the body's immune system attacks its own cells. There are over 80 diseases classified as autoimmune disorders, affecting up to 23.5 million Americans. Obesity affects 32.3% of the US adult population, and could also be considered an inflammatory condition, as indicated by the presence of chronic low-grade inflammation. C-reactive protein (CRP) is a marker of inflammation, and is associated with both adiposity and autoimmune inflammation. This study sought to determine the cross-sectional association between obesity and autoimmune diseases in a large, nationally representative population derived from NHANES 2009–10 data, and the role CRP might play in this relationship. Overall, the results determined that individuals with autoimmune disease were 2.11 times more likely to report being overweight than individuals without autoimmune disease and that CRP had a mediating affect on the obesity-autoimmune relationship. ^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Next-generation sequencing (NGS) technology has become a prominent tool in biological and biomedical research. However, NGS data analysis, such as de novo assembly, mapping and variants detection is far from maturity, and the high sequencing error-rate is one of the major problems. . To minimize the impact of sequencing errors, we developed a highly robust and efficient method, MTM, to correct the errors in NGS reads. We demonstrated the effectiveness of MTM on both single-cell data with highly non-uniform coverage and normal data with uniformly high coverage, reflecting that MTM’s performance does not rely on the coverage of the sequencing reads. MTM was also compared with Hammer and Quake, the best methods for correcting non-uniform and uniform data respectively. For non-uniform data, MTM outperformed both Hammer and Quake. For uniform data, MTM showed better performance than Quake and comparable results to Hammer. By making better error correction with MTM, the quality of downstream analysis, such as mapping and SNP detection, was improved. SNP calling is a major application of NGS technologies. However, the existence of sequencing errors complicates this process, especially for the low coverage (

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this dissertation, we propose a continuous-time Markov chain model to examine the longitudinal data that have three categories in the outcome variable. The advantage of this model is that it permits a different number of measurements for each subject and the duration between two consecutive time points of measurements can be irregular. Using the maximum likelihood principle, we can estimate the transition probability between two time points. By using the information provided by the independent variables, this model can also estimate the transition probability for each subject. The Monte Carlo simulation method will be used to investigate the goodness of model fitting compared with that obtained from other models. A public health example will be used to demonstrate the application of this method. ^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Brain tumor is one of the most aggressive types of cancer in humans, with an estimated median survival time of 12 months and only 4% of the patients surviving more than 5 years after disease diagnosis. Until recently, brain tumor prognosis has been based only on clinical information such as tumor grade and patient age, but there are reports indicating that molecular profiling of gliomas can reveal subgroups of patients with distinct survival rates. We hypothesize that coupling molecular profiling of brain tumors with clinical information might improve predictions of patient survival time and, consequently, better guide future treatment decisions. In order to evaluate this hypothesis, the general goal of this research is to build models for survival prediction of glioma patients using DNA molecular profiles (U133 Affymetrix gene expression microarrays) along with clinical information. First, a predictive Random Forest model is built for binary outcomes (i.e. short vs. long-term survival) and a small subset of genes whose expression values can be used to predict survival time is selected. Following, a new statistical methodology is developed for predicting time-to-death outcomes using Bayesian ensemble trees. Due to a large heterogeneity observed within prognostic classes obtained by the Random Forest model, prediction can be improved by relating time-to-death with gene expression profile directly. We propose a Bayesian ensemble model for survival prediction which is appropriate for high-dimensional data such as gene expression data. Our approach is based on the ensemble "sum-of-trees" model which is flexible to incorporate additive and interaction effects between genes. We specify a fully Bayesian hierarchical approach and illustrate our methodology for the CPH, Weibull, and AFT survival models. We overcome the lack of conjugacy using a latent variable formulation to model the covariate effects which decreases computation time for model fitting. Also, our proposed models provides a model-free way to select important predictive prognostic markers based on controlling false discovery rates. We compare the performance of our methods with baseline reference survival methods and apply our methodology to an unpublished data set of brain tumor survival times and gene expression data, selecting genes potentially related to the development of the disease under study. A closing discussion compares results obtained by Random Forest and Bayesian ensemble methods under the biological/clinical perspectives and highlights the statistical advantages and disadvantages of the new methodology in the context of DNA microarray data analysis.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Random Forests™ is reported to be one of the most accurate classification algorithms in complex data analysis. It shows excellent performance even when most predictors are noisy and the number of variables is much larger than the number of observations. In this thesis Random Forests was applied to a large-scale lung cancer case-control study. A novel way of automatically selecting prognostic factors was proposed. Also, synthetic positive control was used to validate Random Forests method. Throughout this study we showed that Random Forests can deal with large number of weak input variables without overfitting. It can account for non-additive interactions between these input variables. Random Forests can also be used for variable selection without being adversely affected by collinearities. ^ Random Forests can deal with the large-scale data sets without rigorous data preprocessing. It has robust variable importance ranking measure. Proposed is a novel variable selection method in context of Random Forests that uses the data noise level as the cut-off value to determine the subset of the important predictors. This new approach enhanced the ability of the Random Forests algorithm to automatically identify important predictors for complex data. The cut-off value can also be adjusted based on the results of the synthetic positive control experiments. ^ When the data set had high variables to observations ratio, Random Forests complemented the established logistic regression. This study suggested that Random Forests is recommended for such high dimensionality data. One can use Random Forests to select the important variables and then use logistic regression or Random Forests itself to estimate the effect size of the predictors and to classify new observations. ^ We also found that the mean decrease of accuracy is a more reliable variable ranking measurement than mean decrease of Gini. ^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Background. An enlarged tracheoesophageal puncture (TEP) results in aspiration around the voice prosthesis (VP) and may lead to pneumonia. The aims of this research were: (1) to conduct a systematic review and meta-analysis on enlarged TEP; (2) to analyze preoperative, perioperative, and postoperative risk factors for enlarged TEP; and (3) to evaluate control of leakage around the VP using conservative treatments and adverse events in patients with enlarged TEP.^ Methods. A systematic review was conducted (1978-2008). A summary risk estimate was calculated using a random-effects meta-analysis model. A retrospective cohort study was completed. Patients who underwent total laryngectomy and TEP at The University of Texas M. D. Anderson Cancer Center (MDACC) were included. Multiple logistic regression methods were used to assess risk factors for enlargement. Descriptive and bivariate statistics were calculated to evaluate outcomes and adverse events. Results: Twenty-seven manuscripts were included in the systematic review. The summary risk estimate of enlarged TEP/leakage around the VP was 7.2% (95% CI: 4.8%-9.6%). Temporary VP removal and TEP-site injections were the most commonly reported treatments. Neither prosthetic diameter (p=0.076) nor timing of TEP (p=0.297) significantly increased risk of enlargement per stratified analyses of published outcomes. The cumulative incidence of enlarged TEP was 18.6% (36/194, 95% CI: 13.0%-24.1%) in the MDACC cohort. Enlarged TEP occurred exclusively in irradiated patients. Adjusting for length of follow-up and timing of TEP, advanced nodal disease (ORadjusted: 4.3, 95% CI: 1.0-19.1), stricture (ORadjusted : 3.2, 95% CI: 1.2-8.6), and locoregional recurrence/distant metastasis after laryngectomy (ORadjusted: 6.2, 95% CI: 2.3-16.4) increased risk of enlarged TEP. At last follow-up, conservative methods controlled leakage around the VP in 81% (29/36) of patients. Unresolved leakage was associated with recurrent cancer (p=0.081) and TEP-site irregularity (p=0.003). Relative to those without enlargement, enlarged TEP patients had significantly higher risk of pneumonia (RR: 3.4, 95% CI: 1.9-6.2).^ Conclusions. These data establish that enlarged TEP poses serious health risks, and provide insight into medical and oncologic factors that may contribute to development of this complication. In addition, this research supports the use of conservative treatments to address leakage after enlarged TEP in lieu of complete TEP closure.^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A retrospective study of 1353 occupational injuries occurring at a chemical manufacturing facility in Houston, Texas from January, 1982 through May, 1988 was performed to investigate the etiology of the occupational injury process. Injury incidence rates were calculated for various sub-populations of workers to determine differences in the risk of injury for various groups. Linear modeling techniques were used to determine the association between certain collected independent variables and severity of an injury event. Finally, two sub-groups of the worker population, shiftworkers and injury recidivists, were examined. An injury recidivist as defined is any worker experiencing one or more injury per year. Overall, female shiftworkers evidenced the highest average injury incidence rate compared to all other worker groups analyzed. Although the female shiftworkers were younger and less experienced, the etiology of their increased risk of injury remains unclear, although the rigors of performing shiftwork itself or ergonomic factors are suspect. In general, females were injured more frequently than males, but they did not incur more severe injuries. For all workers, many injuries were caused by erroneous or foregone training, and risk taking behaviors. Injuries of these types are avoidable. The distribution of injuries by severity level was bimodal; either injuries were of minor or major severity with only a small number of cases falling in between. Of the variables collected, only the type of injury incurred and the worker's titlecode were statistically significantly associated with injury severity. Shiftworkers did not sustain more severe injuries than other worker groups. Injury to shiftworkers varied as a 24-hour pattern; the greatest number occurred between 1200-1230 hours, (p = 0.002) by Cosinor analysis. Recidivists made up 3.3% of the population (23 males and 10 females), yet suffered 17.8% of the injuries. Although past research suggests that injury recidivism is a random statistical event, analysis of the data by logistic regression implicates gender, area worked, age and job titlecode as being statistically significantly related to injury recidivism at this facility. ^